Search CORE

19 research outputs found

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

Author: Antipov Grigory
Baccouche Moez
Kervadec Corentin
Wolf Christian
Publication venue: HAL CCSD
Publication date: 04/12/2019
Field of study

The large adoption of the self-attention (i.e. transformer model) and BERT-like training principles has recently resulted in a number of high performing models on a large panoply of vision-and-language problems (such as Visual Question Answering (VQA), image retrieval, etc.). In this paper we claim that these State-Of-The-Art (SOTA) approaches perform reasonably well in structuring information inside a single modality but, despite their impressive performances , they tend to struggle to identify fine-grained inter-modality relationships. Indeed, such relations are frequently assumed to be implicitly learned during training from application-specific losses, mostly cross-entropy for classification. While most recent works provide inductive bias for inter-modality relationships via cross attention modules, in this work, we demonstrate (1) that the latter assumption does not hold, i.e. modality alignment does not necessarily emerge automatically, and (2) that adding weak supervision for alignment between visual objects and words improves the quality of the learned models on tasks requiring reasoning. In particular , we integrate an object-word alignment loss into SOTA vision-language reasoning models and evaluate it on two tasks VQA and Language-driven Comparison of Images. We show that the proposed fine-grained inter-modality supervision significantly improves performance on both tasks. In particular, this new learning signal allows obtaining SOTA-level performances on GQA dataset (VQA task) with pre-trained models without finetuning on the task, and a new SOTA on NLVR2 dataset (Language-driven Comparison of Images). Finally, we also illustrate the impact of the contribution on the models reasoning by visualizing attention distributions

arXiv.org e-Print Archive

HAL

Hal-Diderot

Estimating semantic structure for the VQA answer space

Author: Antipov Grigory
Baccouche Moez
Kervadec Corentin
Wolf Christian
Publication venue
Publication date: 09/06/2020
Field of study

Since its appearance, Visual Question Answering (VQA, i.e. answering a question posed over an image), has always been treated as a classification problem over a set of predefined answers. Despite its convenience, this classification approach poorly reflects the semantics of the problem limiting the answering to a choice between independent proposals, without taking into account the similarity between them (e.g. equally penalizing for answering cat or German shepherd instead of dog). We address this issue by proposing (1) two measures of proximity between VQA classes, and (2) a corresponding loss which takes into account the estimated proximity. This significantly improves the generalization of VQA models by reducing their language bias. In particular, we show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models. Finally, by combining our method with a language bias reduction approach, we report SOTA-level performance on the challenging VQAv2-CP dataset

arXiv.org e-Print Archive

HAL

Hal-Diderot

How Transferable are Reasoning Patterns in VQA?

Author: Antipov Grigory
Baccouche Moez
Jaunet Theo
Kervadec Corentin
Vuillemot Romain
Wolf Christian
Publication venue
Publication date: 08/04/2021
Field of study

Since its inception, Visual Question Answering (VQA) is notoriously known as a task, where models are prone to exploit biases in datasets to find shortcuts instead of performing high-level reasoning. Classical methods address this by removing biases from training data, or adding branches to models to detect and remove biases. In this paper, we argue that uncertainty in vision is a dominating factor preventing the successful learning of reasoning in vision and language problems. We train a visual oracle and in a large scale study provide experimental evidence that it is much less prone to exploiting spurious dataset biases compared to standard models. We propose to study the attention mechanisms at work in the visual oracle and compare them with a SOTA Transformer-based model. We provide an in-depth analysis and visualizations of reasoning patterns obtained with an online visualization tool which we make publicly available (https://reasoningpatterns.github.io). We exploit these insights by transferring reasoning patterns from the oracle to a SOTA Transformer-based VQA model taking standard noisy visual inputs via fine-tuning. In experiments we report higher overall accuracy, as well as accuracy on infrequent answers for each question type, which provides evidence for improved generalization and a decrease of the dependency on dataset biases

arXiv.org e-Print Archive

HAL Descartes

HAL

Hal-Diderot

Deep learning for estimation of human semantic traits

Author: Antipov Grigory
Publication venue: EURECOM
Publication date: 05/10/2016
Field of study

EURECOM Repository

Face aging with conditional generative adversarial networks

Author: Antipov Grigory
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/09/2017
Field of study

EURECOM Repository

Apparent age estimation from face images combining general and children-specialized deep learning models

Author: Antipov Grigory
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/06/2016
Field of study

EURECOM Repository

Boosting cross-age face verification via generative age normalization

Author: Antipov Grigory
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2017
Field of study

EURECOM Repository

Learned vs. hand-crafted features for pedestrian gender recognition

Author: Antipov Grigory
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/04/2015
Field of study

EURECOM Repository

Minimalistic CNN-based ensemble model for gender prediction from face images

Author: Antipov Grigory
Publication venue: 'Elsevier BV'
Publication date: 15/01/2016
Field of study

EURECOM Repository